Your browser doesn't support javascript.
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
1.
2022 IEEE International Conference on Big Data, Big Data 2022 ; : 5182-5188, 2022.
Article in English | Scopus | ID: covidwho-2249032

ABSTRACT

The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches. © 2022 IEEE.

2.
J Comput Biol ; 30(4): 432-445, 2023 04.
Article in English | MEDLINE | ID: covidwho-2188058

ABSTRACT

With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , COVID-19/epidemiology , COVID-19/genetics , Genome, Viral , Amino Acids/genetics
3.
Diagnostics (Basel) ; 12(12)2022 Dec 15.
Article in English | MEDLINE | ID: covidwho-2163270

ABSTRACT

SARS-CoV-2 and Influenza-A can present similar symptoms. Computer-aided diagnosis can help facilitate screening for the two conditions, and may be especially relevant and useful in the current COVID-19 pandemic because seasonal Influenza-A infection can still occur. We have developed a novel text-based classification model for discriminating between the two conditions using protein sequences of varying lengths. We downloaded viral protein sequences of SARS-CoV-2 and Influenza-A with varying lengths (all 100 or greater) from the NCBI database and randomly selected 16,901 SARS-CoV-2 and 19,523 Influenza-A sequences to form a two-class study dataset. We used a new feature extraction function based on a unique pattern, HamletPat, generated from the text of Shakespeare's Hamlet, and a signum function to extract local binary pattern-like bits from overlapping fixed-length (27) blocks of the protein sequences. The bits were converted to decimal map signals from which histograms were extracted and concatenated to form a final feature vector of length 1280. The iterative Chi-square function selected the 340 most discriminative features to feed to an SVM with a Gaussian kernel for classification. The model attained 99.92% and 99.87% classification accuracy rates using hold-out (75:25 split ratio) and five-fold cross-validations, respectively. The excellent performance of the lightweight, handcrafted HamletPat-based classification model suggests that it can be a valuable tool for screening protein sequences to discriminate between SARS-CoV-2 and Influenza-A infections.

4.
8th IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2022 ; : 81-88, 2022.
Article in English | Scopus | ID: covidwho-2120513

ABSTRACT

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes the COVID-19 disease in humans, which has reached the scale of a global pandemic. Changes in the composition of the genome of the virus, in the form of mutations, can alter its ability to infect host cells. These modified forms of the virus are known as variants. The spike region of the SARS-CoV-2 genome has a crown-like structure - where 'coronavirus' gets its name. In SARS-CoV-2, it has been noted that mutations happen disproportionately many in the spike region, making this region important for distinguishing different variants. Since amino acids (of the spike protein sequence) are not in a numerical form, they are of no direct use to machine learning algorithms. Thus we use various embedding techniques to make such spike sequence data amenable to machine learning approaches. However, there is ongoing research to find better solutions to study these variants using classification. This paper presents a transformation for spike sequences, called Spike2Signal, to allow the classification of different variants of SARS-CoV-2 using deep learning algorithms. Spike2Signal converts spike sequences into a signal-like representation to allow the classification by state-of-the-art time-series classifiers. Further, we transform this Spike2Signal representation into an image (Spike2Image) to allow the usage of state-of-the-art image classifiers and compare these results with those obtained purely with Spike2Signal. In a wider comparison with existing feature engineering-based methods, we show that the Spike2Signal representation allows to outperform all baselines in predictive power. © 2022 IEEE.

5.
International Journal of Advanced Computer Science and Applications ; 13(8):530-538, 2022.
Article in English | Scopus | ID: covidwho-2025703

ABSTRACT

DNA sequence classification is one of the major challenges in biological data processing. The identification and classification of novel viral genome sequences drastically help in reducing the dangers of a viral outbreak like COVID-19. The more accurate the classification of these viruses, the faster a vaccine can be produced to counter them. Thus, more accurate methods should be utilized to classify the viral DNA. This research proposes a hybrid deep learning model for efficient viral DNA sequence classification. A genetic algorithm (GA) was utilized for weight optimization with Convolutional Neural Networks (CNN) architecture. Furthermore, Long Short-Term Memory (LSTM) as well as Bidirectional CNN-LSTM model architectures are employed. Encoding methods are needed to transform the DNA into numeric format for the proposed model. Three different encoding methods to represent DNA sequences as input to the proposed model were experimented: k-mer, label encoding, and one hot vector encoding. Furthermore, an efficient oversampling method was applied to overcome the imbalanced dataset issues. The performance of the proposed GA optimized CNN hybrid model using label encoding achieved the highest classification accuracy of 94.88% compared with other encoding methods © 2022, International Journal of Advanced Computer Science and Applications.All Rights Reserved.

6.
22nd Annual International Conference on Computational Science, ICCS 2022 ; 13353 LNCS:380-386, 2022.
Article in English | Scopus | ID: covidwho-1958890

ABSTRACT

Detecting and intercepting malicious requests are some of the most widely used ways against attacks in the network security, especially in the severe COVID-19 environment. Most existing detecting approaches, including matching blacklist characters and machine learning algorithms have all shown to be vulnerable to sophisticated attacks. To address the above issues, a more general and rigorous detection method is required. In this paper, we formulate the problem of detecting malicious requests as a temporal sequence classification problem, and propose a novel deep learning model namely GBLNet, girdling bidirectional LSTM with multi-granularity CNNs. By connecting the shadow and deep feature maps of the convolutional layers, the malicious feature extracting ability is improved on more detailed functionality. Experimental results on HTTP dataset CSIC 2010 demonstrate that GBLNet can efficiently detect intrusion traffic with superior accuracy and evaluating speed, compared with the state-of-the-arts. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

7.
2nd Workshop Reducing Online Misinformation through Credible Information Retrieval, ROMCIR 2022 ; 3138:27-47, 2022.
Article in English | Scopus | ID: covidwho-1871513

ABSTRACT

The processing, identification and fact checking of online information has received a lot of attention recently. One of the challenges is that scandalous or "blown up"news tend to become viral, even when coming from unreliable sources. Particularly during a global pandemic, it is crucial to find efficient ways of determining the credibility of information. Fact-checking initiatives such as Snopes, FactCheck.org etc., perform manual claim validation but they are unable to cover all suspicious claims that can be found online - they focus mainly on the ones that have gone viral. Similarly, for the general user it is also impossible to fact-check every single statement on a specific topic. While a lot of research has been carried out in both claim verification and fact-check-worthiness, little work has been done so far on the detection and extraction of dubious claims, combined with fact-checking them using external knowledge bases, especially in the COVID-19 domain. Our approach involves a two-step claim verification procedure consisting of a fake news detection task in the form of binary sequence classification and fact-checking using the Google Fact Check Tools. We primarily work on medium-sized documents in the English language. Our prototype is able to recognize, on a higher level, the nature of fake news, even hidden in a text that seems credible at first glance. This way we can alert the reader that a document contains suspicious statements, even if no already validated similar claims exist. For more popular claims, however, multiple results are found and displayed. We manage to achieve an F1 score of 98.03% and an accuracy of 98.1% in the binary fake news detection task using a fine-tuned DistilBERT model. © 2022 Copyright for this paper by its authors.

SELECTION OF CITATIONS
SEARCH DETAIL